Star Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Read the Dataset

View top and bottom of dataset

Seeing the shape of the data

Data has 56926 rows and 18 columns

Look at data statistics

Check for duplicated values

Look at categorical columns info

Categorical conclusions

Exploratory Data Analysis (EDA)

Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Univariate Analysis

Observations on number of adults and children

Observations on Number of Children

Observations on Number of Adults

Observations on Arrival Month

Observations on Arrival Year

Observations on Market Segment Type

Observations on Booking Status

Observations on Repeated Guest

-96.9% of guests are new guests. Why is there such a low number of returning customers? Need to improve customer experience?

Observations on Meal Plan

Observations on Number of Weekend Nights

Observations on Number of Week Nights

Observations on Parking Space Required

Observations on Room Types

Observations on Number of Special Requests

Observations on Correlation

BiVariate Analysis

Market Segment type vs booking status

Room Type vs Room Price and Cancellation

What are the differences in room prices across the different market segments?

What percentage of repeated guests cancel?

Do special requirements affect booking cancellations?

Cancellations vs Arrival Months

Observations on Dates affecting Booking Status

Room Type vs Cancellations

Data Preprocessing

EDA

Checking Multicollinearity

Additional Information on VIF

Observations:

Building a Logistic Regression model

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will contribute to the revenue but in reality the customer would not have contribute to the revenue. - Loss of resources

  2. Predicting a customer will not contribute to revenue but in reality the customer would have contributed to revenue. - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Logistic Regression Model

Model performance evaluation

First model

The Confusion Matrix

Coefficient Analysis

AUC curve to check model performance

Try to improve the model

Checking new model performance adjusted per threshold from AUC curve

New model from precision/recall curve

Model Comparison

Final Model Summary

Final Logistic Model Summary (Training vs Test Comparison)

Final Observations on Logistic Model

Building a Decision Tree model

Decision Tree Model Observations

Do we need to prune the tree?

Check on Test set to see if pruning is necessary

Visualizing the Tree

Reducing the overfitting model

Using GridSearch for Hyperparameter tuning of our tree model

Post Prune Decision Tree with Alpha

Model Performance Comparison and Conclusions

Actionable Insights and Recommendations